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ABSTRACT 

Robust  low-rank  matrix  estimation  is  a  topic  of  increasing  interest, 
with  promising  applications  in  a  variety  of  fields,  from  computer 
vision  to  data  mining  and  recommender  systems.  Recent  theoreti¬ 
cal  results  establish  the  ability  of  such  data  models  to  recover  the 
true  underlying  low-rank  matrix  when  a  large  portion  of  the  mea¬ 
sured  matrix  is  either  missing  or  arbitrarily  corrupted.  However,  if 
low  rank  is  not  a  hypothesis  about  the  true  nature  of  the  data,  but 
a  device  for  extracting  regularity  from  it,  no  current  guidelines  ex¬ 
ist  for  choosing  the  rank  of  the  estimated  matrix.  In  this  work  we 
address  this  problem  by  means  of  the  Minimum  Description  Length 
(MDL)  principle  -  a  well  established  information-theoretic  approach 
to  statistical  inference  -  as  a  guideline  for  selecting  a  model  for  the 
data  at  hand.  We  demonstrate  the  practical  usefulness  of  our  formal 
approach  with  results  for  complex  background  extraction  in  video 
sequences. 

Index  Terms —  Low-rank  matrix  estimation,  PCA,  Robust 
PC  A,  MDL. 

1.  INTRODUCTION 

The  key  to  success  in  signal  processing  applications  often  depends 
on  incorporating  the  right  prior  information  about  the  data  into  the 
processing  algorithms.  In  matrix  estimation,  low-rank  is  an  all-time 
popular  choice,  with  analysis  tools  such  as  Principal  Component 
Analysis  (PCA)  dominating  the  field.  However,  PCA  estimation  is 
known  to  be  non-robust,  and  developing  robust  alternatives  is  an  ac¬ 
tive  research  field  (see  [  ]  for  a  review  on  low-rank  matrix  estima¬ 
tion).  In  this  work,  we  focus  on  a  recent  robust  variant  of  PCA, 
coined  RPC  A  [  ],  which  assumes  that  the  difference  between  the 
observed  matrix  Y,  and  the  true  underlying  data  X,  is  a  sparse  ma¬ 
trix  E  whose  non-zero  entries  are  arbitrarily  valued.  It  has  been 
shown  in  [  ]  that  X  (alternatively,  E)  can  be  recovered  exactly  by 
means  of  a  convex  optimization  problem  involving  the  rank  of  Y 
and  the  I\  norm  of  E.  The  power  of  this  approach  has  been  re¬ 
cently  demonstrated  in  a  variety  of  applications,  mainly  computer  vi¬ 
sion  (see  [  ]  and  http://perception.csl.uiuc.edu/matrix-rank/ 
appiications.html  for  examples). 

However,  when  used  as  a  pure  data  modeling  tool,  with  no  as¬ 
sumed  “true”  underlying  signal,  the  rank  of  X  in  a  PCA/RPCA  de¬ 
composition  is  a  parameter  to  be  tuned  in  order  to  achieve  some 
desired  goal.  A  typical  case  is  model  selection  [3,  Chapter  7],  where 
one  wants  to  select  the  size  of  the  model  (in  this  case,  rank  of  the  ap¬ 
proximation)  in  order  to  strike  an  optimal  balance  between  the  ability 
of  the  estimated  model  to  generalize  to  new  samples,  and  its  ability  to 
adapt  itself  to  the  observed  data  (the  classic  overfitting/underfitting 
trade-off  in  statistics).  The  main  issue  in  model  selection  is  how  to 
formulate  this  balance  as  a  cost  function. 
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In  this  work,  we  address  this  issue  via  the  Minimum  Description 
Length  (MDL)  principle  [4,  5] . 1  MDL  is  a  general  methodology  for 
assessing  the  ability  of  statistical  models  to  capture  regularity  from 
data.  The  MDL  principle  can  be  regarded  as  a  practical  implemen¬ 
tation  of  the  Occam’s  razor  principle,  which  states  that,  given  two 
descriptions  for  a  given  phenomenon,  the  shorter  one  is  usually  the 
best.  In  a  nutshell,  MDL  equates  “ability  to  capture  regularity”  with 
“ability  to  compress”  the  data,  using  codelength  or  compressibility 
as  the  metric  for  measuring  candidate  models. 

The  resulting  framework  provides  a  robust,  parameter-free  low- 
rank  matrix  selection  algorithm,  capable  of  capturing  relevant  low- 
rank  information  in  the  data,  as  in  the  video  sequences  from  surveil¬ 
lance  cameras  in  the  illustrative  application  here  reported.  From  a 
theoretical  standpoint,  this  brings  a  new,  information  theoretical  per¬ 
spective  into  the  problem  of  low-rank  matrix  completion.  Another 
important  feature  of  an  MDL-based  framework  such  as  the  one  here 
presented  is  that  new  prior  information  can  be  naturally  and  easily 
incorporated  into  the  problem,  and  its  effect  can  be  assessed  objec¬ 
tively  in  terms  of  the  different  codelengths  obtained. 

2.  LOW-RANK  MATRIX  ESTIMATION/ APPROXIMATION 

Under  the  low-rank  assumption,  a  matrix  Y  £  RmXr“  can  be  written 
as  Y  =  X  +  E,  where  rank(X)  -C  min{m,  n}  and  ||E||  <C  ||  Y||, 
where  ||  •  ||  is  some  matrix  norm.  Classic  PCA  provides  the  best  rank- 
k  approximation  to  Y  under  the  assumption  that  E  is  a  random  ma¬ 
trix  with  zero-mean  IID  Gaussian  entries, 

X  =  argmin  ||Y  —  W|L  ,  s.t.  rank(W)  <  k.  (1) 

w 

However,  PCA  is  known  to  be  non-robust,  meaning  that  the  estimate 
X  can  vary  significantly  if  only  a  few  coefficients  in  E  are  modified. 
This  work,  providing  an  example  of  introducing  the  MDL  frame¬ 
work  in  this  type  of  problems,  focuses  on  a  robust  variant  of  PCA, 
RPC  A,  introduced  in  [  ].  RPC  A  estimates  X  via  the  following  con¬ 
vex  optimization  problem, 

X  =  argmm||Y-W||1  +  A||W||t,  (2) 

where  ||W||t  :=  JY  <r(W)j  is  the  nuclear  norm  of  W  (cr(W)i  de¬ 
notes  the  i-th  singular  value  of  W).  The  rationale  behind  (2)  is  as 
follows.  First,  the  I\  fitting  term  allows  for  large  errors  to  occur  in 
the  approximation.  In  this  sense,  it  is  a  robust  alternative  to  the  li 
norm  used  in  PCA.  The  second  term,  A  ||  W||t  ,  is  a  convex  approx¬ 
imation  to  the  PCA  constraint  rank(W)  <  k,  merged  into  the  cost 
function  via  a  Lagrange  multiplier  A. 

This  formulation  has  been  recently  shown  to  be  notoriously  ro¬ 
bust,  in  the  sense  that,  if  a  true  low-rank  matrix  X  exists,  it  can 

1  While  here  we  address  the  matrix  formulation,  the  developed  framework 
is  applicable  in  general,  including  to  sparse  models,  and  such  general  formu¬ 
lation  will  be  reported  in  our  extended  version  of  this  work. 


be  recovered  using  (2)  even  when  a  significant  amount  of  coeffi¬ 
cients  in  E  are  arbitrarily  large  [  ].  This  can  be  achieved  by  setting 
A  =  1  / yj max{m,  n},  so  that  the  procedure  is  parameter-free. 


2.1.  Low-rank  approximation  as  dimensionality  reduction 

In  many  applications,  the  goal  of  low-rank  approximation  is  not  to 
find  a  “true”  underlying  matrix  X,  but  to  perform  what  is  known 
as  “dimensionality  reduction,”  that  is,  to  obtain  a  succinct  represen¬ 
tation  of  Y  in  a  lower  dimensional  subspace.  A  typical  example  is 
feature  selection  for  classification.  In  such  cases,  E  is  not  necessarily 
a  small  measurement  perturbation,  but  a  systematic ,  possibly  large, 
error  derived  from  the  approximation  process  itself.  Thus,  RPCA 
arises  as  an  appealing  alternative  for  low-rank  approximation. 

However,  in  the  absence  of  a  true  underlying  signal  X  (and  de¬ 
viation  E),  it  is  not  clear  how  to  choose  a  value  of  A  that  produces 
a  good  approximation  of  the  given  data  Y  for  a  given  application. 
A  typical  approach  would  involve  some  cross-validation  step  to  se¬ 
lect  A  to  maximize  the  final  results  of  the  application  (for  example, 
minimize  the  error  rate  in  a  classification  problem). 

The  issue  with  cross-validation  in  this  situation  is  that  the  best 
model  is  selected  indirectly  in  terms  of  the  final  results,  which  can 
depend  in  unexpected  ways  on  later  stages  in  the  data  processing 
chain  of  the  application  (for  example,  on  some  post-processing  of  the 
extracted  features).  Instead,  we  propose  to  select  the  best  low-rank 
approximation  by  means  of  a  direct  measure  on  the  intrinsic  abil¬ 
ity  of  the  resulting  model  to  capture  the  desired  regularity  from  the 
data,  this  also  providing  a  better  understanding  of  the  actual  struc¬ 
ture  of  the  data.  To  this  end,  we  use  the  MDL  principle,  a  general 
information-theoretic  framework  for  model  selection  which  provides 
means  to  define  such  a  direct  measure. 


3.  MDL-BASED  LOW-RANK  MODEL  SELECTION 

Consider  a  family  A4  of  candidate  models  which  can  be  used  to 
describe  a  matrix  Y  exactly  (that  is,  losslessly)  using  some  encod¬ 
ing  procedure.  Denote  by  L(Y\M)  the  description  length,  in  bits, 
of  Y  under  the  description  provided  by  a  given  model  M  £  A4. 
MDL  will  then  select  the  model  M  £  M  for  Y  for  which  L(Y\M) 
is  minimal,  that  is  M  =  arg min,vf 6,vi  L(Y\M).  It  is  a  standard 
practice  in  MDL  to  use  the  ideal  Shannon  code  for  translating  prob¬ 
abilities  into  codelengths.  Under  this  scheme,  a  sample  value  y  with 
probability  P(y)  is  assigned  a  code  with  length  L(y)  =  —  log  P(y) 
(all  logarithms  are  taken  on  base  2).  This  is  called  an  ideal  code  be¬ 
cause  it  only  specifies  a  codelength,  not  a  specific  binary  code,  and 
because  the  codelengths  produced  can  be  fractional. 

By  means  of  the  Shannon  code  assignment,  encoding  schemes 
!/(•)  can  be  defined  naturally  in  terms  of  probability  models  P(-). 
Therefore,  the  art  of  applying  MDL  lies  in  defining  appropriate  prob¬ 
ability  assignments  P(-),  that  exploit  as  much  prior  information  as 
possible  about  the  data  at  hand,  in  order  to  maximize  compressibil¬ 
ity.  In  our  case,  there  are  two  main  components  to  exploit.  One  is  the 
low-rank  nature  of  the  approximation  X,  and  the  other  is  that  most 
of  the  entries  in  E  will  be  small,  or  even  zero  (in  which  case  E  will 
be  sparse).  Given  a  low-rank  approximation  X  of  Y,  we  describe  Y 
as  the  pair  (X,  E),  with  E  =  Y  —  X.  Thus,  our  family  of  models  is 
given  by  M  =  {(X,E)  :  Y  =  X  +  E,  rank(X)  <  rank(Y)}.  As 
E  =  Y  —  X,  we  index  A4  solely  by  X.  With  these  definitions,  the 
description  codelength  of  Y  is  given  by  L(YjX)  :=  L(X)  +  L(E). 
Now,  to  exploit  the  low  rank  of  X,  we  describe  it  in  terms  of  its 


reduced  SVD  decomposition, 

X  =  UEVT  U  £  Rmxfe,  E  £  M.kxk:  V  £  RfeXn,  (3) 

where  k  is  the  rank  of  X  (the  zero-eigenvalues  and  the  respective 
left  and  right  eigenvectors  are  discarded  in  this  description).  We  now 
have  L(X)  =  L(U)  +  L(E)  +  L(V).  Clearly,  such  description  will 
be  short  if  rank(X)  is  significantly  smaller  than  max{m,  n}.  We 
may  also  be  able  to  exploit  further  structure  in  U,  E  and  V. 


3.1.  Encoding  E 

The  diagonal  of  E  is  a  non-increasing  sequence  of  k  positive  val¬ 
ues.  However,  no  safe  assumption  can  be  made  about  the  magnitude 
of  such  values.  For  this  scenario  we  propose  to  use  the  universal 
prior  for  integers ,  a  general  scheme  for  encoding  arbitrary  positive 
integers  in  an  efficient  way  [6], 

L(j)  =  log*  j  ■■=  log  j  +  log  log  j  +  . . .  +  log  2.865,  (4) 

where  the  sum  stops  at  the  first  non-positive  summand,  and  log  2.865 
is  added  to  satisfy  Kraft’s  inequality  (a  requirement  for  the  code  to 
be  uniquely  decodable,  see  [7,  Chapter  5]).  In  order  to  apply  (4), 
the  diagonal  of  E,  diag(E),  is  mapped  to  an  integer  sequence  via 
[1016diag(E)],  where  [■]  denotes  rounding  to  nearest  integer  (this  is 
equivalent  to  quantizing  diag(E)  with  precision  Ss  =  10~16). 


3.2.  Encoding  U  and  V,  general  case 


By  virtue  of  the  SVD  algorithm,  the  columns  of  U  and  V  have  unit 
norm.  Therefore,  the  most  general  assumption  we  can  make  about 
U  and  V  is  that  their  columns  lie  on  the  respective  m-dimensional 
and  rt-dimensional  unit  spheres. 

An  efficient  code  for  this  case  can  be  obtained  by  encoding  each 
column  of  U  and  V  in  the  following  manner.  Let  u,  be  a  column  of 
U  (V  is  similarly  encoded).  Since  is  assumed  to  be  distributed 
uniformly  over  the  m-dimensional  unit  sphere,  the  marginal  cumula¬ 
tive  density  function  of  the  first  element  uu,  F{uu)  =  P(x  <  uu), 
corresponds  to  the  proportion  of  vectors  u  that  lie  on  the  unit  spher¬ 
ical  cap  of  height  h  =  1  +  U],  (see  Figure  1(a)).  This  proportion 
is  given  by  F(uu)  =  Am{  1  +  uu,  where  Am(h,r)  and 

Sm(r)  are  the  area  of  spherical  cap  of  height  h  and  the  total  surface 
area  of  the  m-dimensional  sphere  of  radius  r  respectively.  These  are 
given  for  the  case  0  <  h  <  r  (—  1  <  uu  <  0)  by  (see  [8]), 

Am(h,  r)  =  ^Sm(r)I((2hr  —  h2)/r2  ;  — 

Sm{r)  =  27rm/2rm_1r~1(m/2), 


where  I(x  ;  a,b )  =  J°  ‘  ^ - — ,  and  B(a,b)  =  fgta  x(l- 

t)b~1dt  are  the  regularized  incomplete  Beta  function  and  the  Beta 
function  of  parameters  a,  b  respectively,  and  T(-)  is  the  Gamma 
function.  When  r  <  h  <  ’2r  we  simply  have  Am{h,r)  =  1  — 
Am(2r  —  h,r).  For  encoding  uu  we  have  r  =  1  so  that 

F(Uli)  =  (1/2)7(1  -  u2u;  (m  -  l)/2, 1/2), -1  <  uu  <  0,  (5) 


since  2 h  —  h2  =  h( 2  —  h)  =  (1  +  uu)[2  —  (1  +  «i;)]  =  1  —  Uu. 
Finally,  we  compute  the  Shannon  codelength  for  uu  as 


^  ^  (a)  (1  -  3)/2(w?i)  1/2 ,  0_  ^ 

p(wii)  —  F  (uu)  —  p  fm-l  1\  (  -U~Li) 

*  '  ^  V  2  ’2  ) 

=  — sgn(wii)(l  -  M2i)(m_3,/2[B((rn  -  l)/2, 1/2)]" 
-logp(wri)  =  -m  3  log(l  -  Uu)  +  log B((m  -  l)/2, 1/2), 
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Fig.  1.  (a)  The  spherical  cap  of  radius  r  and  height  h  (shown  in 

gray),  (b)  Causal  bilinear  prediction  of  smooth  2D  images. 


where  in  (a)  we  applied  the  Fundamental  Theorem  of  Calculus  to 
the  definition  of  F(h)  and  the  chain  rule  for  derivatives. 

With  un  encoded,  the  vector  ( U2i,U3i , . . .  ,umt )  is  uniformly 
distributed  on  the  surface  of  the  ( m  —  1) -dimensional  sphere  of  ra¬ 
dius  r'  —  1  —  |itii|,  and  we  can  apply  the  same  formula  to  compute 
the  probability  of  u2i,  F(u2i )  =  Am_ i(w2i  +  r1  ,r')/ Sm-i(r'). 

Finally,  to  encode  the  next  column  Ui+i,  we  can  exploit  its  or¬ 
thogonality  with  respect  to  the  previous  ones  and  encode  it  as  a  vec¬ 
tor  distributed  uniformly  over  the  m  —  i  dimensional  sphere  cor¬ 
responding  to  the  intersection  of  the  unit  sphere  and  the  subspace 
perpendicular  to  [ui, ....  u,]. 

In  order  to  produce  finite  descriptions  L( U)  and  L(V),  both 
U  and  V  also  need  to  be  quantized.  We  choose  the  quantization 
steps  for  U  and  V  adaptively,  using  as  a  starting  point  the  empirical 
standard  deviation  of  a  normalized  vector,  that  is,  5U  =  \J 1/m  and 
5V  =  y/l/n  respectively,  and  halving  these  values  until  no  further 
decrease  in  the  overall  codelength  L(YjX)  is  observed. 

3.3.  Encoding  U  predictively 

If  more  prior  information  about  U  and  V  is  available,  it  should  be 
used  as  well.  For  example,  in  the  case  of  our  example  application, 
the  columns  of  Y  are  consecutive  frames  of  a  video  surveillance 
camera.  In  this  case,  the  columns  of  U  represent  “eigen-frames” 
of  the  video  sequence,  while  V  contains  information  about  the  evo¬ 
lution  in  time  of  those  frames  (this  is  clearly  observed  in  figures  2 
and  3).  Therefore,  the  columns  of  U  can  be  assumed  to  be  piecewise 
smooth,  just  as  normal  static  images  are.  To  exploit  this  smoothness, 
we  apply  a  predictive  coding  to  the  columns  of  U.  Concretely,  to 
encode  the  i-th  column  u,  of  U,  we  reshape  it  as  an  image  B  of 
the  same  size  as  the  original  frames  in  Y.  We  then  apply  a  causal 
bilinear  predictor  to  produce  an  estimate  of  B,  B  =  {bji}  where 
bji  =  bji  -  bj(i- 1)  -  b(j_i)j  +  b(j_ i)(i_i),  assuming  out-of-range 
pixels  to  be  0.  The  prediction  residual  B  =  B  B  is  then  en¬ 
coded  in  raster  scan  as  a  sequence  of  Laplacian  random  variables 
with  unknown  parameter  9lu.  This  encoding  procedure,  common  in 
predictive  coding,  is  depicted  in  Figure  1(b). 

Since  the  parameters  {9lu,  i  =  1, .  . . ,  fcj  are  unknown,  we  need 
to  encode  them  as  well  to  produce  a  complete  description  of  Y.  In 
MDL,  this  is  done  using  the  so-called  universal  encoding  schemes , 
which  can  be  regarded  as  a  generalization  of  classical  Shannon  en¬ 
coding  to  the  case  of  distributions  with  unknown  parameters  (see  [5] 
for  a  review  on  the  subject).  In  this  work  we  adopt  the  so-called 
universal  two-part  codes,  and  apply  it  to  encode  each  column  u; 
separately.  Under  this  scheme,  the  unknown  Laplacian  parameter 
for  9lu  is  estimated  via  Maximum  Likelihood,  6^(ui),  and  quantized 
with  precision  1/y/m,  thus  requiring  L(9[l)  =  |  logm  +  ci  bits. 


Given  the  quantized  9 „,  u;  is  described  using  the  discretized  Lapla¬ 
cian  distribution  L(u,:)  =  —  log  P(u;j0^(ui))  +  c2.  Here  ci  and  c2 
are  constants  which  can  be  disregarded  for  optimization  purposes.  It 
was  shown  in  [4]  that  the  precision  1  /  y/m  asymptotically  yields  the 
shortest  two-parts  codelength. 


3.4.  Encoding  V  predictively 

We  also  expect  a  significant  redundancy  in  the  time  dimension,  so 
that  the  columns  of  V  are  also  smooth  functions  of  time  (in  this 
case,  sample  index  j  =  1,  2, . . . ,  n).  In  this  case,  we  apply  a  first  or¬ 
der  causal  predictive  model  to  the  columns  of  V,  by  encoding  them 
as  sequences  of  prediction  residuals,  v,  =  (vn,  Vi2, . . . ,  Vi„),  with 
Vij  =  Vij—Vi(j- 1)  for  j  >  1  andFn  =  vn.  Each  predicted  column 
V;  is  encoded  as  a  sequence  of  Laplacian  random  variables  with  un¬ 
known  parameter  9lv.  As  with  U,  we  use  a  two-parts  code  here  to 
describe  the  data  and  the  unknown  Laplacian  parameters  together. 
This  time,  since  the  length  of  the  columns  is  n,  the  codelength  asso¬ 
ciated  to  each  91  is  L(9‘v)  =  ^  log  n. 


3.5.  Encoding  E 

We  exploit  the  (potential)  sparsity  of  E  by  first  describing  the  in¬ 
dexes  of  its  non-zero  locations  using  an  efficient  universal  two-parts 
code  for  Bernoulli  sequences  known  as  Enumerative  Code  [9],  and 
then  the  non-zero  values  at  those  locations  using  a  Laplacian  model. 
In  the  specific  case  of  the  experiments  of  Section  4,  we  encode  each 
row  of  E  separately.  Because  each  row  of  E  corresponds  to  the  pixel 
values  at  a  fixed  location  across  different  frames,  we  expect  some  of 
these  locations  to  be  better  predicted  than  others  (for  example,  loca¬ 
tions  which  are  not  occluded  by  people  during  the  sequences),  so  that 
the  variance  of  the  error  (hence  the  Laplacian  parameter)  will  vary 
significantly  from  row  to  row.  As  before,  the  unknown  parameters 
here  are  dealt  with  using  a  two-parts  coding  scheme. 


3.6.  Model  selection  algorithm 

To  obtain  the  family  of  models  M  corresponding  to  all  possible  low- 
rank  approximations  of  Y,  we  apply  the  RPCA  decomposition  (2) 
for  a  decreasing  sequence  of  values  of  A,  {At  :  t  —  1,2,...}  ob¬ 
taining  a  corresponding  sequence  of  decompositions  {(Xt,  Et),  t  = 
1,2,...}.  We  obtain  such  sequence  efficiently  by  solving  (2)  via  a 
simple  modification  of  the  Augmented  Lagrangian-based  (ALM)  al¬ 
gorithm  proposed  in  [10]  to  allow  for  warm  restarts,  that  is,  where 
the  initial  ALM  iterate  for  computing  (Xt,Et)  is  (Xt_i,Et-i). 
We  then  select  the  pair  (X;,  E(-),  t  =  argmint{L(Xt)  +  L(Et)}. 

4.  RESULTS  AND  CONCLUSION 

In  order  to  have  a  reference  base,  we  repeated  the  experiments  per¬ 
formed  in  [2]  using  our  algorithm.  These  experiments  consist  of 
frames  from  surveillance  cameras  which  look  at  a  fixed  point  where 
people  pass  by.  The  idea  is  that,  if  frames  are  stacked  as  columns 
of  Y,  the  background  can  be  well  modeled  as  a  low-rank  compo¬ 
nent  of  Y  (X),  while  the  people  passing  by  appear  as  “spurious  er¬ 
rors”  (E).  Clearly,  if  the  background  in  all  frames  is  the  same,  it 
can  be  very  well  modeled  as  a  rank-1  matrix  where  all  the  columns 
are  equal.  However,  lighting  changes,  shadows,  and  reflections, 
“raise”  the  rank  of  the  background,  and  the  appropriate  rank  needed 
to  model  the  background  is  no  longer  obvious. 


Fig.  2.  Results  for  the  “Lobby”  sequence  (see  text  for  a  description 
of  the  above  pictures  and  graphs).  The  rank  of  the  approximation 
decomposition  for  this  case  is  k  =  10.  The  moment  where  the  lights 
are  turned  off  is  clearly  seen  here  as  the  “square  pulse”  in  the  middle 
of  the  first  two  right-eigenvectors  (bottom-right  figure).  Also  note 
how  U2  (top-right)  compensates  for  changes  in  shadows. 


Fig.  3.  Results  for  the  “ShoppingMall”  sequence  (see  text  for  a  de¬ 
scription  of  the  above  pictures  and  graphs).  In  this  case,  the  rank 
of  the  approximation  decomposition  is  k  =  7.  Here,  the  first  left- 
eigenvector  models  the  background,  whereas  the  rest  tend  to  capture 
people  that  stood  still  for  a  while.  Here  we  see  the  “phantom”  of  two 
such  persons  in  the  second  left-eigenvector  (top-right). 


Concretely,  the  experiments  here  described  correspond  to  two 
sequences:  “Lobby”  and  “ShoppingMall,”  whose  corresponding  re¬ 
sults  are  summarized  respectively  in  figures  2  and  3. 2  At  the  top 
of  both  figures,  the  first  two  left-eigenvectors  Ui  and  U2  of  X  are 
shown  as  2D  images.  The  middle  shows  two  sample  frames  of  the 
error  approximation.  The  L-vs-A  curve  is  shown  at  the  bottom-left 
(note  that  the  best  A  is  not  the  one  dictated  by  the  theory  in  [  ],  which 
are  0.007  for  Lobby  and  0.0035  for  ShoppingMall,  both  outside  of 
the  plotted  range),  and  the  scaled  right-eigenvectors  ctiVi  are  shown 
on  the  bottom-right.  In  both  cases,  the  resulting  decomposition  re¬ 
covered  the  low-rank  structure  correctly,  including  the  background, 
its  changes  in  illumination,  and  the  effect  of  shadows.  It  can  be  ap¬ 
preciated  in  the  figures  2-3  how  such  approximations  are  naturally 
obtained  as  combinations  of  a  few  significant  eigen- vectors,  starting 
with  the  average  background,  followed  by  other  details. 

4.1.  Conclusion 

In  summary,  we  have  presented  an  MDL-based  framework  for  low- 
rank  data  approximation,  which  combines  state-of-the-art  algorithms 
for  robust  low-rank  decomposition  with  tools  from  information  the¬ 
ory.  This  framework  is  able  to  capture  the  underlying  low-rank  in¬ 
formation  on  the  experiments  that  we  performed,  out  of  the  box, 
and  without  any  hand  parameter  tuning,  thus  constituting  a  promis¬ 
ing  competitive  alternative  for  automatic  data  analysis  and  feature 
extraction. 


2The  full  videos  can  be  viewed  at  http://www.tc.umn.edu/ 
-nacho/lowrank/. 
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